First part of this notebook based on Karthik Ram’s GGPLOT2 Lecture (CC-By 2.0)

GOALS: Students should be able to use ggplot2 to generate publication quality graphics and understand and use the basics of the grammar of graphics.

DataViz

Terminology:

Think in terms of variable types

Viz

Viz

Organizing your project space

R gives us a concept of a project and it is good practice to create one when you start a project.

  1. Go to File > New project
  2. Select New directory > R project
  3. Name a new directory on your Desktop data-viz click ok
  4. On the right lower pane add a New Folder called data

We are ready to start.

Notebooks in R - In-line Viz in dynamic docs

  1. Show diff b/t R scripts and notebooks
  2. Some basick Rmarkdown
  3. So how to play vis - interative tinkering

Loading Data in R

We can load data into R via multiple mechanisms:

Datasets from base-R and packages

  • Can be part of packages or base-R – these are often set up for teaching R or as examples for how to use a package. How do you find out what’s there? Use data(). More datasets will show after package installation and loading.
data()
library(openintro)
Please visit openintro.org for free statistics materials

Attaching package: 'openintro'
The following objects are masked from 'package:MASS':

    housing, mammals
The following object is masked from 'package:ggplot2':

    diamonds
The following objects are masked from 'package:datasets':

    cars, trees
data()
  • We can load one of these dataframes into our current session by using data():
data(diamonds)
head(diamonds)
  carat       cut color clarity depth table price    x    y    z
1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
#View(diamonds)
  • Note: these are loaded into the R session but aren’t saved as a file on our file system.
  • If you want to open one of these as a csv you can write it out using the write_csv
write_csv(diamonds, 'data/diamonds.csv')

Reading data from the filesystem

Download from the web

  • To demonstrate this let’s first download a csv using the download.file()
  • We use the URL to a data file – the full url including the filename for the first argument
  • For the second argument, add the path plus / and the file name you want to save as.
#download.file(url, path/filename)
download.file("https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv", 'data/gapminder-FiveYearData.csv')

Read the data into R

  • There’s a family of read functions that read data in as a data frame
read_csv('data/gapminder-FiveYearData.csv')
Parsed with column specification:
cols(
  country = col_character(),
  year = col_integer(),
  pop = col_double(),
  continent = col_character(),
  lifeExp = col_double(),
  gdpPercap = col_double()
)
# A tibble: 1,704 x 6
   country      year      pop continent lifeExp gdpPercap
   <chr>       <int>    <dbl> <chr>       <dbl>     <dbl>
 1 Afghanistan  1952  8425333 Asia         28.8       779
 2 Afghanistan  1957  9240934 Asia         30.3       821
 3 Afghanistan  1962 10267083 Asia         32.0       853
 4 Afghanistan  1967 11537966 Asia         34.0       836
 5 Afghanistan  1972 13079460 Asia         36.1       740
 6 Afghanistan  1977 14880372 Asia         38.4       786
 7 Afghanistan  1982 12881816 Asia         39.9       978
 8 Afghanistan  1987 13867957 Asia         40.8       852
 9 Afghanistan  1992 16317921 Asia         41.7       649
10 Afghanistan  1997 22227415 Asia         41.8       635
# ... with 1,694 more rows
  • read_csv by itself will print out the data frame
  • we need to save it as an object in R to reuse it
gapminder <- read_csv('data/gapminder-FiveYearData.csv')
Parsed with column specification:
cols(
  country = col_character(),
  year = col_integer(),
  pop = col_double(),
  continent = col_character(),
  lifeExp = col_double(),
  gdpPercap = col_double()
)

First Plots with GGPLOT

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

library(ggplot2)
data(iris)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()

Basic structure

ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()

myplot <- ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width))
myplot + geom_point()

Increase size of points

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(size = 3)

Make it colorful

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point(size = 3)

Differentiate points by shape

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point(aes(shape = Species), size = 3)

Exercise 1

# Make a small sample of the diamonds dataset
d2 <- diamonds[sample(1:dim(diamonds)[1], 1000), ]

Then generate this plot below. (open 09-plot-ggplot2-ex-1-1.png)

ggplot(d2, aes(carat, price, color = color)) + geom_point() + theme_gray()

Switch to Gapminder Data

library(readr)  #from tidyverse
#download.file("https://goo.gl/BtBnPg", 'data/gapminder-FiveYearData.csv')
#gapminder <- read.csv("https://goo.gl/BtBnPg")
gapminder <- read_csv('data/gapminder-FiveYearData.csv')
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

NOTE:

  • First we call the ggplot function -any arguments we provide the ggplot function are considered global options: they apply to all layers
  • We passed two arguments to ggplot:
  • data
  • an aes function - which tells ggplot how variables map to aesthetic properties
  • x & y locations

Alone the ggplot call isn’t enough to render the plot.

ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap))
## If run, would produce an error.

Need to tell ggplot how we want to present variables by specifying a geom layer. In the above example we used geom_point to create a scatter plot.

ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
  geom_point()

Box plots

See ?geom boxplot for list of options

library(MASS)
head(birthwt)
   low age lwt race smoke ptl ht ui ftv  bwt
85   0  19 182    2     0   0  0  1   0 2523
86   0  33 155    3     0   0  0  0   3 2551
87   0  20 105    1     1   0  0  0   1 2557
88   0  21 108    1     1   0  0  1   2 2594
89   0  18 107    1     1   0  0  1   0 2600
91   0  21 124    3     0   0  0  0   0 2622
ggplot(birthwt, aes(factor(race), bwt)) + geom_boxplot()

ggplot(birthwt, aes(factor(race), bwt)) + geom_boxplot() +
  scale_x_discrete(labels = c("white", "black", "other"))

Histograms

See ?geom histogram for list of options

h <- ggplot(faithful, aes(x = waiting))
h + geom_histogram(binwidth = 30, colour = "black")

h <- ggplot(faithful, aes(x = waiting))
h + geom_histogram(binwidth = 8, fill = "steelblue", colour = "black")

Line plots

Download data

#download.file('https://raw.github.com/karthikram/ggplot-lecture/master/climate.csv', 'data/climate.csv')
#climate <- read.csv(text=RCurl::getURL(https://raw.github.com/karthikram/ggplot-lecture/master/climate.csv))
climate <- read_csv("data/climate.csv")
Warning: Missing column names filled in: 'X1' [1]
Parsed with column specification:
cols(
  X1 = col_integer(),
  Source = col_character(),
  Year = col_integer(),
  Anomaly1y = col_character(),
  Anomaly5y = col_character(),
  Anomaly10y = col_double(),
  Unc10y = col_double()
)
ggplot(climate, aes(Year, Anomaly10y)) +
  geom_line()

We can also plot confidence regions

confidence interval gives a range of plausible values for a parameter. It depends on a specified confidence level with higher confidence levels corresponding to wider confidence intervals and lower confidence levels corresponding to narrower confidence intervals. Common confidence levels include 90%, 95%, and 99%.

Usually we don’t just begin chapters with a definition, but confidence intervals are simple to define and play an important role in the sciences and any field that uses data. You can think of a confidence interval as playing the role of a net when fishing. Instead of just trying to catch a fish with a single spear (estimating an unknown parameter by using a single point estimate/statistic), we can use a net to try to provide a range of possible locations for the fish (use a range of possible values based around our statistic to make a plausible guess as to the location of the parameter).

We’ll set ymax and ymin to Anomaly10y plus or minus Unc10y (Figure 4-25):

ggplot(climate, aes(Year, Anomaly10y)) +
  geom_ribbon(aes(ymin = Anomaly10y - Unc10y, ymax = Anomaly10y + Unc10y),
              fill = "blue", alpha = .1) +
  geom_line(color = "steelblue")

Exercise 2

Modify the previous plot and change it such that there are three lines instead of one with a confidence band.

cplot <- ggplot(climate, aes(Year, Anomaly10y))
cplot <- cplot + geom_line(size = 0.7, color = "black")
cplot <- cplot + geom_line(aes(Year, Anomaly10y + Unc10y), linetype = "dashed", size = 0.7, color = "red")
cplot <- cplot + geom_line(aes(Year, Anomaly10y - Unc10y), linetype = "dashed", size = 0.7, color = "red")
cplot + theme_gray()

#theme_classic
#theme_bw()
#theme_minimal()

Gapminder line graph

Using scatter plot not the best way to visualize change over time. Let’s use line plot.

ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
  geom_line()

ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
  geom_line() + geom_point()

ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country)) +
  geom_line(aes(color=continent)) + geom_point()

Bar Plots

ggplot(iris, aes(Species, Sepal.Length)) +
  geom_bar(stat = "identity")

library(tidyr)
#df <- melt(iris, id.vars = "Species")
df <- gather(iris, variable, value, -Species )
ggplot(df, aes(Species, value, fill = variable)) +
  geom_bar(stat = "identity")

The heights of the bars commonly represent one of two things: either a count of cases in each group, or the values in a column of the data frame. By default, geom_bar uses stat=“bin”. This makes the height of each bar equal to the number of cases in each group, and it is incompatible with mapping values to the y aesthetic. If you want the heights of the bars to represent values in the data, use stat=“identity” and map a value to the y aesthetic.

Dplyr and Tidyr

These two packages are the Swiss army knives of R. dplyr * filter * select * mutate * tidyr. * gather * spread * separate

Let’s look at iris again.

iris[1:2, ]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
df <- gather(iris, variable, value, -Species ) 
df[1:2, ]
  Species     variable value
1  setosa Sepal.Length   5.1
2  setosa Sepal.Length   4.9
ggplot(df, aes(Species, value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge")

Exercise 3

Using the d2 dataset you created earlier, generate this plot below. Take a quick look at the data first to see if it needs to be binned

d2 <- diamonds[sample(1:dim(diamonds)[1], 1000), ]
ggplot(d2, aes(clarity, fill = cut)) +
  geom_bar(position = "dodge")

Exercise 4

clim <- read.csv('data/climate.csv', header = TRUE)
clim$sign <- ifelse(clim$Anomaly10y<0, FALSE, TRUE)
# or as simple as
# clim$sign <- clim$Anomaly10y < 0
ggplot(clim, aes(Year, Anomaly10y)) + geom_bar(stat = "identity", aes(fill = sign)) + theme_gray()

Density Plots

ggplot(faithful, aes(waiting)) + geom_density()

ggplot(faithful, aes(waiting)) +
  geom_density(fill = "blue", alpha = 0.1)

ggplot(faithful, aes(waiting)) +
  geom_line(stat = "density")

Colors

aes(color = variable) 
aes(color = "black")
# Or add it as a scale
scale_fill_manual(values = c("color1", "color2"))
library(RColorBrewer) 
display.brewer.all() 

Using a color brewer palette

#df  <- melt(iris, id.vars = "Species")
ggplot(df, aes(Species, value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_brewer(palette = "Set1")

Manual color scale

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  facet_grid(Species ~ .) +
  scale_color_manual(values = c("red", "green", "blue"))

Transformations and statistics

ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap, color=continent)) +
  geom_point()

ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
  geom_point() + scale_y_log10()

ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
  geom_point() + scale_y_log10() + geom_smooth(method="lm")

pwd <- ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
  geom_point() + scale_y_log10() + geom_smooth(method="lm", size=1.5)
  1. Here we set the size aesthetic by passing it as an argument to geom_smooth.
  2. use the aes function to define a mapping between data variables and their visual representation.

With iris data - smooth

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point(aes(shape = Species), size = 3) +
  geom_smooth(method = "lm")

Within facet

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point(aes(shape = Species), size = 3) +
  geom_smooth(method = "lm") +
  facet_grid(. ~ Species)

Multi-panel figures: FACEting

starts.with <- substr(gapminder$country, start = 1, stop = 1)
az.countries <- gapminder[starts.with %in% c("A", "Z"), ]
ggplot(data = az.countries, aes(x = year, y = lifeExp, color=continent)) +
  geom_line() + facet_wrap( ~ country)

Modifying text

ggplot(data = az.countries, aes(x = year, y = lifeExp, color=continent)) +
  geom_line() + facet_wrap( ~ country) +
  xlab("Year") + ylab("Life expectancy") + ggtitle("Figure 1") +
  scale_colour_discrete(name="Continent") +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

http://swcarpentry.github.io/r-novice-gapminder/08-plot-ggplot2#challenge-5

With iris along coloumns

#str(iris)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  facet_grid(Species ~ .)

### And along rows

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  facet_grid(. ~ Species)

Or wrap your panels

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  facet_wrap( ~ Species)

Themes

Themes are a great way to define custom plots.

+theme()

see ?theme() for more options

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point(size = 1.2, shape = 16) +
  facet_wrap( ~ Species) +
  theme(legend.key = element_rect(fill = NA),
  legend.position = "bottom",
  strip.background = element_rect(fill = NA),
  axis.title.y = element_text(angle = 0)) 

#install.packages("ggthemes")
library(ggthemes)

Then add one of these themes to your plot

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point(size = 1.2, shape = 16) +
  facet_wrap( ~ Species) +
  theme_solarized() +
  theme(legend.key = element_rect(fill = NA),
        legend.position = "bottom",
        strip.background = element_rect(fill = NA),
        axis.title.y = element_text(angle = 0)) 

How to save your plots

ggsave('~/path/to/figure/filename.png')
ggsave(plot1, file = "~/path/to/figure/filename.png")
ggsave(file = "/path/to/figure/filename.png", width = 6,
height =4)
ggsave(file = "/path/to/figure/filename.eps")
ggsave(file = "/path/to/figure/filename.jpg")
ggsave(file = "/path/to/figure/filename.pdf")

Resources:

This is just a taste of what you can do with ggplot2. RStudio provides a really useful cheat sheet of the different layers available, and more extensive documentation is available on the ggplot2 website. Finally, if you have no idea how to change something, a quick Google search will usually send you to a relevant question and answer on Stack Overflow with reusable code to modify!